AI - Project 4

Ali Javidan ( 810896047 )

0.1

0.2

0.3

OverallQual, GrLivArea, GarageCars, GarageArea

0.4

After applying log10 on target feature scores seems to have been proved.

0.6

1.1

Deleting Rows with missing values

Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.

Pros:

Cons:

Mean/Median/Mode Imputation

In this method, any missing values in a given column are replaced with the mean (or median, or mode) of that column. This is the easiest to implement and comprehend.

Regression Imputation

This approach replaces missing values with a predicted value based on a regression line. Regression is a statistical method which shows the relationship between a dependent variable and independent variables. It's expressed as y = mx + b where m is the slope, b is a constant, x is the independent variable and y is the dependent variable.

1.2

PoolQC : 99.52054794520548 %
MiscFeature : 96.30136986301369 %
Alley : 93.76712328767123 %
Fence : 80.75342465753424 %
FireplaceQu : 47.26027397260274 %
LotFrontage : 17.73972602739726 %
GarageType : 5.5479452054794525 %
GarageYrBlt : 5.5479452054794525 %
GarageFinish : 5.5479452054794525 %
GarageQual : 5.5479452054794525 %
GarageCond : 5.5479452054794525 %
BsmtExposure : 2.6027397260273974 %
BsmtFinType2 : 2.6027397260273974 %
BsmtQual : 2.5342465753424657 %
BsmtCond : 2.5342465753424657 %
BsmtFinType1 : 2.5342465753424657 %
MasVnrType : 0.547945205479452 %
MasVnrArea : 0.547945205479452 %
Electrical : 0.0684931506849315 %

Frequent Categorical Imputation

Assumptions: Data is Missing At Random (MAR) and missing values look like the majority.

Description: Replacing NAN values with the most frequent occurred category in variable/column.

1.3

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

Normalization

Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Standardization

Assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

As mentioned upper Normalization and Standardization are useful when data has varying scales, so it'd be better if we apply Normalization on datas when using k-nearest neighbors regression method as the method is assuming the distribution of datas is not Gaussian, and on the other hand apply Standardization when using linear regression as the method is assuming the distribution of datas is Gaussian.

1.4

Numeral

Binary Feature Encoding

Binary features are those with only two possible values.

Ordinal Feature Encoding

Ordinal features are those with some order associated with them.

Nominal Features

Nominal features are categorical features that have no numerical importance. Order does not matter.

0.8

Mutual Information

We are dropping both Street and CentralAir because of poor information gain.

We are ignoring nominal features with information gain of < 0.1.

1.6

Evaluation Metrics

Here I use three metrics for evaluation Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error. MSE in our problem is very big and can't give us much information. But about MAE and RMSE:

They are almost the same but the difference between them is:

Differences: Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable. The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency distribution of error magnitudes also increases.

In our problem, I think because we have very large numbers it is better to work with MAE.

K Nearest Neighbors Regression

Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors. Here we try to tune the n_neighbors parameter to find best number of neighbors in our model.

Decision Tree Regression

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Here, I try to tune max_depth parameter to find best depth that our model can train. As we can see from result after fifth depth, data get overfitted. So, I use 5 as best depth for our model.

Linear Regression

The most simple way of traning a regression model is using the Linear Regression algorrithm. This Algorithm hasn't any specific parameter to tune. LinearRegression fits a linear model with coefficients to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

The result is not very bad, but we should use more complex models because our model isn't linear.

Random Forest Regression

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features. Here, I try to tune max_depth parameter to find best depth that our model can train. As we can see from result as depth go further, we have better result. So, I use 4 as best depth for our model.